This form is a web page which was created in MS WORD and therefore can be easily edited that way

Entry Name: Purdue-Zhao-MC1

VAST Challenge 2015
Mini-Challenge 1

Team Members:

Jieqiong Zhao, Purdue University, zhao413@purdue.edu PRIMARY

Guizhen Wang, Purdue University, wang1908@purdue.edu

Junghoon Chae, Purdue University, jchae@purdue.edu

Hanye Xu, Purdue University, xu193@purdue.edu

Siqiao Chen, Purdue University, chen1722@purdue.edu

William Hatton, United States Air Force Academy, C16william.hatton@usafa.edu

Mahesh Babu Gorantla, Purdue University, mgorantl@purdue.edu

Benjamin Ahlbrand, Purdue University, bahlbran@purdue.edu

Jiawei Zhang, Purdue University, zhan1486@purdue.edu

Abish Malik, Purdue University, amalik@purdue.edu

Sungahn Ko, Purdue University, ko@purdue.edu

Sherry Towers, Arizona State University, smtowers@asu.edu

David Ebert, Purdue University, ebertd@purdue.edu

Student Team: NO

Did you use data from both mini-challenges? YES

Analytic Tools Used:

Our custom designed system developed for the challenge, R package.

We applied three algorithms for clustering. Please refer to Appendix at the bottom of this document.

Approximately how many hours were spent working on this submission in total?

100.

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2015 is complete? YES

Video Download

Video:

http://pixel.ecn.purdue.edu:8080/~zhan1486/VASTCHALLENGE15/MC1.wmv

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Questions

MC1.1 – Characterize the attendance at DinoFun World on this weekend. Describe up to twelve different types of groups at the park on this weekend.

a. How big is this type of group?

b. Where does this type of group like to go in the park?

c. How common is this type of group?

d. What are your other observations about this type of group?

e. What can you infer about this type of group?

f. If you were to make one improvement to the park to better meet this group’s needs, what would it be?

Limit your response to no more than 12 images and 1000 words.

In the provided MC1 dataset with so many different actors (IDs) and aspects, there is a medley of ways to group people together. The clustering ability of our tool allowed us to highlight the following groups. We applied three different clustering algorithms to define groups for different aspects.

We group the people who prefer attending the same attraction and spend a majority of their time at the attraction. These results are obtained using the k-means clustering technique. The results are shown in the form of a line graph (see images below (Left)), where the x-axis shows the attractions, and the y-axis shows the percentage of time people within the cluster spend at a specified attraction. The results are also shown as a graph (Middle), where the distance between the nodes indicates how similar they are. The number of individuals in each cluster is shown on the right of the image.

Figure 1: Results of Kmeans clustering based on attraction preferences.

a. How big is this type of group?

Among many groups having different patterns in visiting attractions, we pick the groups that all spend lot of time in Sabre Tooth Theatre.

Cluster 2 with 619 customers on Friday,

Cluster 2 with 498 customers on Saturday,

Cluster 3 with 1383 customers on Sunday

b. Where does this type of group like to go in the park?

Attraction 64

c. How common is this type of group?

We find this type of group to be prevalent every day.

d. What are your other observations about this type of group?

Because there is a crime that is discovered on Sunday, the police close the stage. So people who prefer to see Scott’s stage show go to the shows presented in 64. This may be the reason the number of people of this group suddenly raised on Sunday (from 498 on Saturday to 1383 on Sunday).

These people do not prefer going to stage to see the Scott’s show. Because the check-in frequency and time they spend on 63 where Scott gives the show are comparatively low.

e. What can you infer about this type of group?

These people like shows very much.

f. If you were to make one improvement to the park to better meet this group’s needs, what would it be?

We would recommend putting the places that host shows closer to one another.

Figure 2: Heatmap of cluster 2 on Saturday 8:00am to 23:30pm.

BASED ON K-MEANS on attraction categories

Figure 3: Kmeans results based on attraction categories for three days.

We apply our k-means clustering technique to the time fraction each person spent on every attraction category, the check-in time and movement around that attraction category. The result on this method provides a view based on aggregated categories. We observe that the k-means method clusters people into 5 groups. The ticks on the x axis are the attractions that are grouped into thrill rides, kiddie rides, rides for everyone, food, restrooms, beer gardens, shopping, shows, information, exit and entry.

Group 1: Cluster with a black line in all three days: people spend much time in checking in thrill rides category and small time for checking-in rides for everyone.

a. The size of group 1 is the top two groups having the biggest size. Friday: 804, Saturday: 1796, Sunday: 1936

b. They like attractions related with thrill rides.

c. This group appears every day.

d. This group is not so much interested in Scott related activities, since they spend a very small ratio around the shows attraction category.

e. This group may be youths since they enjoy visiting thrill rides and rides for everyone.

f. It seems like thrill rides are one of the most popular attraction categories. Currently, attractions of thrill rides are located quite far away from each other. The park may locate thrill rides attractions together to reduce travel time.

Group 2: A cluster with a green (Friday and Sunday) and a blue (Saturday) line: People spend much time for show category

a. The group size is the middle size, Friday:734, Saturday:1279, Sunday:873

b. This group enjoys both rides for everyone and attending Scott’s show.

c. This group appears every day.

d. Compared to other groups, this group is very interested on Scott’s activities. But they still like other attractions and also enjoy spending time on thrill rides and rides for everyone.

e. This group may be Scott’s funs. They came to the park, mainly for Scott.

f. It seems like people who like show still have some interest on thrill rides and rides for everyone. The park may schedule the show time to make sure that the show opening time is when thrill rides and kiddie rides have a large volume of visitors in order to balance the visitor volume among the three categories.

Group3: A cluster with an orange in three days: People spend more time on kiddie lands.

a. The group size is the smaller group. Friday: 321, Saturday: 1984, Sunday: 589

b. Group 4 spends more interest on kiddie rides, meaning Group 4 brings their children along with them.

c. This group appears every day.

d. This group is not very interested on Scott’s activities

e. This group may be families with children.

f. It seems that families with children like rides for everyone and thrill rides in addition to kiddie lands. The park should make sure that rides for everyone and thrill rides also have some facilities to host children.

Next we applied the sequence-based clustering that considers orders in check-ins.

Figure 4: Bar chart representation for sequence clustering results for three days.

Among the clusters generated by C3, we present the top 10 clusters (ranked based on the number of customers in every cluster) as shown in Figure 4. This figure shows the clustering results of grouping the people based on the check-in sequence of attraction categories that they visited. The height of each row encodes the number of people within the cluster. The color of each bin within the cluster shows an attraction category. We confirm that the groups in each day enjoy different types of attractions in different orders.

The largest group on Friday

a. 1619 customers

b. on Friday took 14 of attractions while the largest group on Saturday with 4765 customers and Sunday with 6839 customers use 20 and 23 attractions each day. In Figure 4, a group of customers who used the largest number of attractions is easily recognizable (e.g., 41 customers used 28 attractions on Saturday). In addition, customers on Friday took the least number of attractions on average compared to others.

2. People who come to the park together and remain together throughout the entire day. In other words, they go to the same places at the same exact time. Our tool produced clustering results based on check-in sequences in order to identify these groups. If people check-in at the same locations throughout the day in the same order, we believe they are travelling together.

a. The size of this type of group ranges from 2 to 40 people. The average size is 4-6 people.

b. These groups have no set preference in the park. Since this type of group applies to almost everyone in the park, there are no clear inclinations for any specific attraction.

c. This type of group is quite common. A strong majority of IDs in the dataset come to the park with at least one other person and travel with these other people throughout the day.

d. No other observations.

e. We can infer this type of group is the appearance of friends, family, school field trips, etc. coming to the park. Generally, people do not go to amusement parks alone. The people that fall into a group of this type arrived at the park with a set group of people on purpose and intended to enjoy the park with them at all times.

f. The park could better meet the needs of these groups by widening the paths in the park so everyone can walk together closer. Although it is clear these people always traveled together, a slight difference in their movement totals occurs possibly because they are not changing grid squares at the same time due to the inability to walk side-by-side.

WE DO CROSS-VALIDATION BETWEEN CLUSTERING RESULTS AND MC-2 DATA HERE.

3. People who do not travel together but communicate with each other by messaging.

Figure 5: Node graph of sequence clustering.

a. This type of group often occurs between several smaller groups who are travelling together. The average size of this group is 7-8 people, but those people may be grouped into smaller groups of 2 to 3 people who travel together. For example, the group comprised of IDs 163330, 268563, 513541, 651950, 725559, 825258, 879813, 1375106, and 2056236 all communicate together, but do not all travel together. Rather, these IDs fit into only three groups that travel together, but still all communicate. In Figure 5, these 9 IDs communicate together break into 3 three groups highlighted in grey in the K-means graph, which means they are broken into three clusters of interests.

b. Again, this type of group has no preference in the park because they travel all over.

c. There are at least 5 such groups that come to the park every day and between 10 and 15 groups for each individual day.

d. The individuals in these groups sometimes have overlapping check-in times at certain attractions, which could mean they are meeting with the others that they communicate with, but do not travel with.

e. The people in this group are likely those who know each other and split up into smaller groups at the park based on preference, or simply meet each other at the park, become friends, and continue to communicate that day.

f. This group does not necessarily have a need, but the park could enhance their experience by showing the locations of the friends that they most communicate with

4. People who travel together, but do not communicate by messaging. We found there are many clusters presenting customers traveling together. But we did not find customers in a same cluster communicate each other based on MC-2 data.

5. Park personnel: This group includes the park employees and security personnel. For example, we find that the person with ID 1278894 sends out large group messages throughout the day. We believe that this person is a park employee who sends information to visitors who are not currently checked in to any attractions. Furthermore, we hypothesize that the person with ID 839736 is a park security personnel who is disseminating information to other park personnel and/or the general public after the crime was discovered on Sunday.

MC1.2 – Are there notable differences in the patterns of activity on in the park across the three days? Please describe the notable difference you see.

Limit your response to no more than 3 images and 300 words.

Figure 6: Heatmap based on check-in data of all visitors during Scott's show.

The most notable is the lack of check-ins to the pavilion (Building #32) and the performance stage (area #63) on Sunday after 12:00 as shown in box (a) and (b) Figure 6. On Friday and Saturday, Scott Jones performed a show at the stage at 10:00 AM and 3:00 PM, which garnered large check-in totals between 9:00 and 10:00 AM and 2:00 and 3:00 PM. On Sunday, we see the same pattern occur for the 10:00 morning show; however, the phenomenon is missing for the 3:00 show. The heat maps from our tool in Figure 6 illustrate the popularity of each check-in location during a specific time frame. On Friday and Saturday, it is clear the performance stage is popular for check-ins between 2:00 and 3:00, but Sunday’s heat map lacks any heat signature, displaying no one checked in to the stage at that time. The line graphs also show the reduced number of check-ins

In conjunction with the lack of a performance Sunday afternoon, there is also an absence of any check-ins to the pavilion after 12:00 PM. On the first two days of the weekend, the pavilion is one of the most popular attractions outside the time Scott Jones is performing. Contradicting this trend, there are zero check-ins to the pavilion Sunday afternoon and evening. Thus, unlike the beginning of the weekend, the pavilion was closed for some time.

Figure 7: Check-in numbers of all visitors from 14:00 to 15:00 for three days.

Also, IDs 644885 and 521750 frequent the park every day together and go to the performance stage for each show Scott Jones completes. On Sunday, these two leave the park after the first show and do not return for the second, confirming the second show was cancelled on Sunday.

Figure 8: Trajectories of two IDs(644885, 521750) for three days.

MC1.3 – What anomalies or unusual patterns do you see? Describe no more than 10 anomalies, and prioritize those unusual patterns that you think are most likely to be relevant to the crime.

Limit your response to no more than 10 images and 500 words.

While a vast majority of the visitors to the park participate in similar activities and generally adhere to expected activity, there are several anomalies which illustrate people breaking the norms of the park.

1. One of the largest breaks from the norm is the people who check-in only at the entrance to the park. There are between 20 and 30 IDs throughout the weekend who simply just check-in at the park entrance and then fail to check-in at any rides or attractions. For example, the ids listed below are the customers who checked in to the park on Friday, but did not check in any other places.

Figure 9: Users who only check in at entrances on Friday.

2. There is a number of low check-in IDs who seem to visit the park’s attractions, but do not check-in and use them. In a comparison to reality, it is possible that these people are grandparents or supervisors escorting others to the rides, or watching over them without any desire to participate. However, there are some people who have low check-ins because they are busy with the park’s attractions that are not rides, such as the Beer Gardens, Restaurants, or Restrooms.

3. One person with ID 392618 shows anomalous activities. The person came to park between 9 and 10 and acted normally until 2:55pm at the show area as shown in the heatmap below. Then, the person moved for 6 hours in the same area without any check-ins and suddenly jumped to building 37 around 8:43pm as shown in the trajectory view (right black box).

Figure 10: Abnormal movement analysis of User 392618.

4. The small number of check-in anomalies also helped present two IDs (521750 and 644885) who followed a strange pattern. To start, they actually left the park during the day and then checked back in to the entrance when they returned in the afternoon. There were very few IDs who left the park and came back within the same day, but these two traveled together and left and returned at the same time. The two people walked around the park toward the performance stage (#63), waited two hours outside without a check-in or any movement (a half hour before and after Scott’s performance), and then walked around the park and exited at 12:15 PM. They returned at 1:45 each day and completed the same pattern of movement for the afternoon show, departing the park at 5:15 PM. Such behavior was not reproduced by any other IDs. The picture below shows the movement data for these two IDs on three days afternoon from 1:00 PM to 11:00 PM. They didn’t come back to the park on Sunday afternoon.

Figure 11: Trajectories of two IDs(644885, 521750) for three days from 13:00 to 23:00.

8. A few people check-in to the pavilion more than 5 times. These are shown below:

Figure 12: Visitors who check in at pavilion more than 5 times.

10. User ID=1711922 was the person who spent the most time at the pavilion. This person does not check in at any attractions. There is only one check in for exit.

Appendix

Clustering algorithm 1)

We utilize the k-means algorithm that a mature and fast algorithm to cluster data points. In using k-means, we consider customers time spent on 42 attractions where customers check-in and aim to find groups of the customers based on preference on attractions. One example cluster is the one consisted of Mr. Scott’s fans that tend to spend most of their time on Creighton Pavilion and Stage to see the shows. The node-link graph generated based on distance among nodes present similarity among clusters.

Clustering algorithm 2)

We implement sequence-based clustering to group people based on check-in sequences in categories of attractions. In this approach, we first find the longest common subsequence (LCS) to measure the similarity of at least two customers sequence. Then, we apply a density based clustering algorithm, DBSCAN to group customers.

Clustering algorithm 3)

We also implement a check-in based clustering approach where people are clustered based on the same check-in sequences among attractions.

Comparison between Clustering 2 and Clustering 3: Clustering 3 aims at grouping customers considering who travel together by utilizing check-in data while C2 groups customers with consideration of preference of (attraction) categories extracted by sequences of customers’ visits in attraction categories (e.g., thrill rides).

Entry Name: Purdue-Zhao-MC1

VAST Challenge 2015 Mini-Challenge 1

Team Members:

Jieqiong Zhao, Purdue University, zhao413@purdue.edu PRIMARY

Analytic Tools Used:

VAST Challenge 2015
Mini-Challenge 1